3. Search systems

Today's topics

Information retrieval systems

Interfaces of information search

Conclusions

Long history of information retrieval (IR) research

No new research topic on text search

New algorithms for images and movies

Difficult to find a song

Information retrieval interfaces are far from ideal

Incremental search

Word suggestion / completion

Human life is the series of information searches

Most of our everyday life are related to searches

We seldom create information from scratch

Example 1: Sending an email message

Find the email application

Find the "compose" button

Find the destination address

Enter the message

Find attachment files

Find the "send" button

Example 2: various hobbies

Find a nice restaurant

Find an interesting TV program

Find an interesting movie

Find interesting news articles

Find a music

Find interesting exhibitions

Example 3: GUI widgets

Menus

Find programs and functions

Icons

Find programs and data

Folders

Find data based on hierarchical data structure

Scrollbar

Find something out of sight

History of information retrieval

Text search

Internet search

Human-powered search

Integrated search

The age of full-text search (〜1995)

「Information Retrieval」

ACM SIGIR (Special Interest Group on Information Retrieval)

Ordinary people never used search systems

Research was matured

Multimedia search (videos, music) not popular

Dawn of Internet(〜2000)

IR researchers began creating search engines

Lycos, AltaVista, Inktomi, ...

Ordinary people begay using text search systems

Search by human power(〜2010)

Google!

Use page ranks

Use the power of people

Social filtering

Social bookmarks

Search integration (～2020)

Various attributes used for search

location, time, etc.

Find information from vague keys

New search algorithms

SVM, LSI, ...

New AI-based search algorithms (～2030?)

Text search techniques

Data size and search methods

Huge

Use indexes

Small

Pattern matching

Middle

Combination of both

Methods for large text database

Long research history

Information Retrieval (IR)

ACM SIGIR (Special Interest Group on Information Retrieval)

Index database is created beforehand

Database search

Text search

Precision and recall

Precision

Correct results / All search results

Recall

Correct results / All correct results

Precision and recall

http://masui.sfc.keio.ac.jp/dbee8fd0f7e8f192bc024f4d49e9d4e9.graffle http://gyazo.com/aa28558496ec2d2ad76d23619ce71a43.png

Both value should be 100% ideally

Various tradeoffs

F value = $ 2 \over {{ 1 \over {Recall}} + {1 \over {Precision}}}

"harmonic mean" of precision and recall

Boolean search

Check if a word is included in a text

Use word-document matrix

http://gyazo.com/7e30f277ce8b72bdb57a37f9fde7c63b.png

Boolean search

Pros

AND/OR/NOT search available

e.g. Brutus AND Caesar but NOT Calpurnia

Cons

Huge matrix required

Ranking impossible

Inverted index

Compressing word-document matrix

https://gyazo.com/28862fde6cc350e2d73aa08958ae81ca

Handling words

Extract words from documents

Fairly easy in English

Difficult in Japanese

Normalize

Process variations (go, going, went ...)

Stemming

Remove "stop words" (prepositions, articles, etc.)

to be or not to be

Vector space mode

Represent a document as a word vector

Use inner product for calculating similarity

Word-document matrix

http://gyazo.com/5d8bf7d946262da0b78887494b276900.png

Vector space model

Add weights to words based on its frequency in documents

tf-idf algorithm

$ \bold{tf}: Term Frequency

$ \bold{idf}: Inverse Document Frequency

tf-idf

$ \bold{tfidf}_{ij} = \bold{tf}_{ij} \times \bold{idf}_iweight of $ iin document$ j

$ \bold{tf}_{ij} =$ \bold{n}_{ij} \over \sum_k{\bold{n}_{kj}}

$ \bold{n_{ij}} =frequency of word $ \bold{t}_iin document $ \bold{d_j}

this value is large if a term appears more times than other term

if $ \bold{d}_1is a b c c d e f, $ \bold{tf}_{c1}is$ 2/7、$ \bold{tf}_{a1}is$ 1/7

if $ \bold{d}_2is a b c d e f g h$ \bold{tf}_{c2}is $ 1/8、$ \bold{tf}_{a2}is $ 1/8

$ \bold{idf}_i = \log(\bold{N} / \bold{df}_i)

$ \bold{N} = total number of documents

$ \bold{df}_i =number of documents that include $ \bold{t}_i

becomes small if many documents include the term

number of documents is 10 and 2 of them include c, $ \bold{idf}_c = log(10/2) = 1.6 $ \bold{tfidf}_{c1} = 0.460 $ \bold{tfidf}_{c2} = 0.201

number of documents is 10 and 4 of them include a,$ \bold{idf}_a = log(10/4) = 0.91 $ \bold{tfidf}_{a1} = 0.131 $ \bold{tfidf}_{a2} = 0.114

As a result, if a term is frequently used in a specific document, tf-idf value becomes larger

Association search

Proposed by Prof. Takan at NII

https://gyazo.com/d58fe4fa74b11089a324873f52277a1d

Vector space search + modern interface

Using GETA library

GETA

Word-document matrix (WAM: Word-Article Matrix) handling library

Association search library

Clustering library

Paperbacks map

https://gyazo.com/b22f90d8499368e7ee030f118f3164d6

https://gyazo.com/f4b2e26a6b1cc9ef0c069dbb90facf49

想-IMAGINE Book Search

https://gyazo.com/3b96031051cbd5ffd233352961722d43

Relevance Feedback

Use the search results for next search

Can be manual or automatic

Use the best search result for next search

Use the words in the top result for next search

Search results improved

Problems of relevance feedback

Users don't send feedbacks

Longer query

Users are not interested recall ratio

Query expansion

http://gyazo.com/e696d55b51d26c7016a48c7e11ac976a.png

Use thesaurus and search log

Useful on Web keyword search

Probability model

Modeling the word sequences based on probability

Calculate the probability of a document matching a query word

Based on theory

tf-idf calculation is based on heuristics

LSI (Latent Semantic Indexing)

Use "singular value decompositions" (SVD) of word-document matrix

SVD calculation

Convert a word-document matrix to smaller-rank matrix

Smaller calculation + higher recall

Small-scale text search

Scan full text with text search algorithms

Use pattern match algorithms

Pattern matching algorithms

Perfect match algorithm

Regular expression

Approximate pattern matching

Perfect matching algorithm

Find a pattern P[1..m] in text T[1..n]

Time-consuming if characters are checked one by one from the left

Faster algorithms

Knuth-Morris-Pratt (KMP) method

Boyer-Moore method

Shifter algorithm

Simple algorithm

Compare T[1..m]and P[1..m]

T: aaaaaaaaaaaaaaaaab

P: aaaaaab

Compare T[2..m+1] and P[1..m]

T: aaaaaaaaaaaaaaaaab

P: aaaaaab

...

Compare T[n-m+1..n] and P[1..m]

$ n \times m comparesons required

Knuth-Morris-Pratt algorithm

Faster algorithm invented by Donald Knuth, et al.

P[7] (= b)で不一致が検出されたとき

前のテキストはaaaaaaであることが既知

P[6] (= a)の比較をやり直せばよい

不一致が検出されたときに比較をやりなおすべきPの文字の位置をあらかじめ「失敗関数」として計算

無駄な比較を繰り返さないようにする

http://gyazo.com/1d7b988f5207f5d70182f576cd28e525.png

Example of Knuth-Morris-Pratt algorithm

P = abcaba の場合

http://gyazo.com/981d1b79dceedfa85cf5995ea44be3b4.png

5 → 1 の遷移

5の状態でテキストの最後がabcaであることが既知

→2の状態まで戻れる

5から6に遷移できない場合

次の文字はbでない

→1の状態まで戻れる

Boyer-Moore algorithm

Fastest text comparison algorithm

Number of character comparisons can be smaller than the string length

All the characters should be checked in KMP method

Check the pattern from the right

1回目の文字比較でT[m] = c であることがわかる

cはPに含まれていないので2回目の比較はT[m+1]の先から調べればよい

Since P does not contain c, second check can be performed from T[m+1]

http://gyazo.com/c0ef58ec05df6c96c8591907363c7b2c.png

Boyer-Moore algorithm (Cont'd)

右からk文字だけ一致したときにその次にどこから比較を再開すればよいかをあらかじめパタンから計算

Use two pre-calculated tables

d1[] -- 右端で不一致が検出されたときの、移動数[不一致文字]の表

d2[] -- 右からk文字目で不一致が検出されたときの、移動数[不一致位置]の表

先の例の場合、d1['a'] = 1, d2['c'] = d1['d'] = ... = 8である。

"Shifter" algorithm

Represent the matching state by a $ mbit pattern

$ kth bit represents $ k character match

e.g. 11101 = mathing 2 characters

When the next character is $ c, left-shift this value one bit, and calculate the next state by OR with T[c]

T[c]はあらかじめ計算しておく値で、パタン文字==cの位置だけ0になっている。

T[c] is a pre-calculated pattern array

e.g. pattern abac → T['a'] = 0101

Shifter algorithm (Cont'd)

Sometimes faster than Boyer-Moore algorithm

Used in a utility agrep

Asearch

Approximate search library by Masui

https://gyazo.com/ae40ec58ecbabb3935f47f531340894f

Available on Ruby and Node

https://rubygems.org/gems/asearch/versions/0.1.0

https://github.com/shokai/node-asearch

Regular expression

Powerful expression for specifying text patterns

Ambiguous matching patterns

Became popular on Unix tools

Regular expression patterns

Selection

abc or def

Character class

a or b or c

Repetition

more than 3 a's

Examples of regular expression

abc

(abc|def)

[abc]

ab*c

Examples of regular expression

(時計|時間|時刻)を([0-9]*)時に(セットする|設定する|あわせる)

ファイルを((きっちり(きっちり)?)?)消す

Generative grammer

Grammatical rules for generating sentences

Terminal symbols (e.g. "mountain") and non-terminal symbols (e.g. noun)

noun ← "mountain"

adjective ← "high"

noun phrase ← adjective noun

Regular grammar

A grammer representing regular expression

Simplest grammer in Chomsky hierarchy

Only the form A ← aB (A/B: nonterminal, a: terminal)

Handling regular expression

Convert to state transition machine

e.g. a(b|cd) and its grammer and state transition machine

A ← aS

B ← bA

C ← cA

B ← dC

http://gyazo.com/ef47894d05a71d085fb31cdd9527910e.png

Pattern matching algorithms for regular expression

Aho-Corasick method

grep method

egrep method

Theory of regular expression algorithms

https://www.amazon.co.jp/dp/B07JHRL2NS https://gyazo.com/a7cf80318f7d193f583a6f9dbeb36562

Aho-Corasick method

Understand (string1|string2|...) pattern

Extention to KMP algorithm

Used in fgrep command

Exmple: ac|baa|bacd|ba|bb

http://gyazo.com/7926b6c03cc66c5f883d4aa2a4a14da7.png

grep algorithm

Used in grep command in Unix

Recursively call functions that take care of *、?, etc.

Cannot handle patterns like (abc|def)

Described in "Software Tools"

egrep algorithm

Used in egrep command in UNIX

Can handle all regular expressions

Algorithm

Convert a regular expression to NDFA(Non-deterministic finite automaton)

Convert NDFA to DFA (Deterministic finite automaton)

Create a state transition table of DFA

Fast pattern matching

Example of NDFA

pattern (SUB)*SECTION

http://gyazo.com/2df5627603e51e8f04ed465eda1e9f4b.png

Convert NDFA to DFA

http://gyazo.com/d9e5779fe46637b3fedf7783e0e2d1d3.png

Generated DFA

http://gyazo.com/c74f45af19228955d3a50cb796d6781b.png

(Original NDFA)

http://gyazo.com/2df5627603e51e8f04ed465eda1e9f4b.png

What a regular expression cannot do

Cannot count numbers

No state is used

Cannot handle approximate matching

Middle-level pattern matching

Don't use large index

"Signature method"

Also know as "BitFunnel"

Express a word as a bit pattern

「express」→ 0000000100000000

「word」→ 0000000000001000

Use or for the combination

「express word」→ 0000000100001000

Use and operation between search keywords and document pattern

Information retrieval and direct manipulation

Direct manipulation

Contiinuous

Reversible

Concrete

Information retrieval based on direct manipulation

Result changes as soon as search condition changes

Go back to the previous state by undo

Not popular even today

Incremental search

Search result changes as soon as keyword changes

Emacs search

Norikae-annai (changing trains)

Google Suggest

Norikae-annai

http://gyazo.com/f0e800ae8648adac94c020bd61951a79.png

Demo: Norikae-an'nai

Migemo

Japanese incremental search on Emacs

http://gyazo.com/de4be34b79fe5fab36f765428f69c806.png

Demo: Migemo

Limitation of text search

Are you satisfied with Google search?

Good search keyword required

Difficult to search from vague information

Good restaurant

Interesting book

Google is just a keyword search system

Other services for specific needs

Search methods other than keyword-based search

Page rank

Link-based search

Zooming search

Location-based search

Approximate pattern matching

Faceted search

Page rank

Invented by Larry Page (Founder of Google)

Use link information for calculating weights

Page rank becomes large if linked from important pages

Basically based on human power

Larry Page

https://gyazo.com/5a7ea84315f9ec3569afb46f070aa3d3

Page rank calculation

https://gyazo.com/43e0c52c1f5ab35dab98ed2b2ba40d90

Ｘ＝ＺＸis receiving the only link fromＺ

Ｙ＝X/2 Y is receiving one of the two links fromＸ

Ｚ＝X/2＋ＹＺ is receiving one of the two links from Ｘ, and receiving the only link from Y

⇒ X : Y : Z = 0.4 : 0.2 : 0.4

Link-based search

What is this?

http://gyazo.com/03435fbba38fe55a3572342c2bde1118.png

I could find it through the grapevine

How I found the info

http://gyazo.com/03435fbba38fe55a3572342c2bde1118.png

Draw one pattern

⇒ Ask people if they know it

⇒ One person said it is "Hana-chiru-sato"

⇒ Found it on google

Link-based search

/masui/芋蔓検索のすすめ

/masui/リンク情報の活用

Neighbor-based search

Put similar information together

Similar date, content, foldeer, etc.

http://gyazo.com/49ff2980552894951bb1a1e0e12368a7.png

Naboring info

http://gyazo.com/9ff211e2c2f30b87fe047ecff7bd0f49.png

Link-based search on Scrapbox

https://gyazo.com/2c70348a630a341146bc9fcba90b3111 https://scrapbox.io/UIPedia/Scott%20R.%20Klemmer

dshelf

Browsing contents only using arrow keys

Listed books are similar in some sense

https://gyazo.com/21eb8e60eb236cc47eb96118b6fe501a.png

Demo: dshelf

Zooming search

Use information visualization technique

https://gyazo.com/2f8eb8956a9ba8c8838a9c23a47718d2

Location-based search

Powerful, but not popular yet

Photoloc

https://gyazo.com/65e2fe004ba378b99a265c14b681249d

Zooming photo search

https://gyazo.com/b545e286d2692b42ade2ec90b11af33f http://www.pitecan.com/tmp/RainbowZoomer/photobrowser2.html#%E3%83%AF%E3%82%A4%E3%83%B3

"Do you mean?" search

Sometimes used in Google search

Faceted search

Categories (facets) are provided beforehand

Users can select facets for

Simply reduce the size of target database

Popular on various EC sites

Document by Marti Hearst

http://gyazo.com/13d047dd83955b6d5ea69bdae0390aea.png http://scope.knowledge-works.co.jp/2012/06/%E3%83%87%E3%82%A3%E3%83%AC%E3%82%AF%E3%83%88%E3%83%AA%E6%A4%9C%E7%B4%A2-vs-%E3%83%95%E3%82%A1%E3%82%BB%E3%83%83%E3%83%88%E6%A4%9C%E7%B4%A2/

Demo:

https://gyazo.com/f1b22415869c2dcc430f084475f25c2a

Helpfeel

https://gyazo.com/883080ab4c36aa6200dd691e6a619e6f https://helpfeel.notainc.com/SFCHelp/?q=

Use the data on /SFCHelp

Helpfeel

https://youtu.be/Nbwh7KrfIVA

https://youtu.be/QjiNaYfQ4Ak

Helpfeel taxi ad

https://vimeo.com/539549975

Other information retrieval systems

Image search

Video search

Audio search

Future of search interface

Integration of new search algorithms and interaction techniques]

Integration of search / input / registration